Finding Persistent Items in Data Streams
نویسندگان
چکیده
Frequent item mining, which deals with finding items that occur frequently in a given data stream over a period of time, is one of the heavily studied problems in data stream mining. A generalized version of frequent item mining is the persistent item mining, where a persistent item, unlike a frequent item, does not necessarily occur more frequently compared to other items over a short period of time, rather persists and occurs more frequently over a long period of time. To the best of our knowledge, there is no prior work on mining persistent items in a data stream. In this paper, we address the fundamental problem of finding persistent items in a given data stream during a given period of time at any given observation point. We propose a novel scheme, PIE, that can accurately identify each persistent item with a probability greater than any desired false negative rate (FNR) while using a very small amount of memory. The key idea of PIE is that it uses Raptor codes to encode the ID of each item that appears at the observation point during a measurement period and stores only a few bits of the encoded ID in the memory of that observation point during that measurement period. The item that is persistent occurs in enough measurement periods that enough encoded bits for the ID can be retrieved from the observation point to decode them correctly and get the ID of the persistent item. We implemented and extensively evaluated PIE using three real network traffic traces and compared its performance with two prior adapted schemes. Our results show that not only PIE achieves the desired FNR in every scenario, its FNR, on average, is 19.5 times smaller than the FNR of the best adapted prior art.
منابع مشابه
Monitoring persistent items in the union of distributed streams
A persistent item in a stream is one that occurs regularly in the stream without necessarily contributing significantly to the volume of the stream. Persistent items are often associated with anomalies in network streams, such as botnet traffic and click fraud. While it is important to track persistent items in an online manner, it is challenging to zero-in on such items in a massive distribute...
متن کاملCR-precis: A Deterministic Summary Structure for Update Data Streams
We present deterministic sub-linear space algorithms for problems over update data streams, including, estimating frequencies of items and ranges, finding approximate frequent items and approximate φ-quantiles, estimating inner-products, constructing near-optimal B-bucket histograms and estimating entropy. We also present improved lower bound results for several problems over update data streams.
متن کاملFinding Frequent Items over General Update Streams
We present novel space and time-efficient algorithms for finding frequent items over general update streams. Our algorithms are based on a novel adaptation of the popular dyadic intervals method for finding frequent items. The algorithms improve upon existing algorithms in both theory and practice.
متن کاملA nearly optimal and deterministic summary structure for update data streams
We present a deterministic summary structure over update streams that enables deterministic and the first space-optimal algorithms for a variety of problems, including, estimating frequencies, finding approximate frequent items, finding approximate quantiles, finding hierarchical heavy hitters, approximately optimal B-bucket histograms, estimating inner product sizes, etc..
متن کاملPTS: Projected Topological Stream clustering algorithm
High-dimensional data streams clustering is an attractive research topic, as there are several applications that generate a high number of attributes, bringing new challenges in terms of partitioning due to the curse of dimensionality. In addition, those applications produce unbounded sequences of data which cannot be stored for later analysis. Although the importance of this scenario, there ar...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- PVLDB
دوره 10 شماره
صفحات -
تاریخ انتشار 2016